An Interactive Geospatial and Economic Analysis for City of Melbourne for Business Insights
Authored by: Naga Nikhil Woopalanchi
Duration: 120 mins
Level: Advanced
Pre-requisite Skills: Python, Machine Learning, Deep Learning (Tensorflow), Predictive Analysis
Scenario

The City of Melbourne is seeking to develop a comprehensive tool that can aid urban planners, policymakers, and business owners in fostering economic growth and supporting local businesses. To acheive this, they require a deep understanding of the city's business landscape, including the location, industry classification, and job creation of businesses within the city.

This use case focuses on creating an interactive platform that integrates multiple datasets to provide a holistic view of the business environment in the City of Melbourne. The core component of this projects included street data providing geographic coordinates of property addresses, business establishment data covering from period 2002 to 2022 with information on business address, industry classification (ANZSIC4), location, CLUE small area and CLUE industry classification, as well as business establishments and job data offering insights into the number of jobs and business establishments.

At the end of this use case you will:

  • Data Integration: Merging and integrating diverse datasets using common identifiers.
  • Data Cleaning: Handling missing values,correcting inconsistencies, and preparing data for analysis.
  • Exploratory Data Analysis: Uncovering patterns, trends, and relationship through visulizations.
  • Geospatial Analysis: Mapping business locations and analysing spatial distributions using GIS.
  • Data Modeling: Building predictive models for economic and urban analysis.
  • Industry Insights: Understanding the economic structure of the City of Melbourne by industry and size.
  • Urban Planning & Economic Development: Analyzing the link between activity, urban planning, and growth.
  • Project Management: Managing a data science project from data acquisition to presenting result.
Description of the problem

To aid urban planner/policymaker/business owner in the City of Melbourne, I want to have a comprehension understanding of the city's business landscape, so that I can make informed decisions, support existing business, and foster economic growth in region.

Specifically, I need to be able to:

  1. Understanding geographical distribution of different businessess in the City of Melbourne, including location and clustering of different industries.
  2. Analyze the industry composition of the business community, including the number and size of establishments, as well as the number of jobs provided by each industry.
  3. Identifying potential area for business development and growth, based on factors such as industry concentration, job creation, and emerging trends.
  4. Support local businesses by providing insights into the competitive landscape, market opportunities, and potential areas for collaboration or resource sharing.
  5. Inform policymaking and urban planning decisions, such as infrastructure investments, zoning regulations, and economic development stratergies, to create a more conductive environment for businesses to thrive.

By having access to a comprehensive and integrated platform that combines various datasets on the city's businessess, I will be able to make data-driven decisions, identify areas of oppurtunity, and implement policies and initiatives that foster a vibrant and prosperious business community in the City of Melbourne.

Introduction

The City of Melbourne is seeking to develop a comprehensive tool that can aid urban planners, policymakers, and business owners in fostering economic growth and supporting local businessess. To acheive this, they require a deep understanding of the city's business landscape, including the location, industry classification, and job creation of businesses within the city.

This project aims to create a robust and interactive platform that integrates multiple datasets to provide a holistic view of the business environment in the City of Melbourne. The key componenets of the project include street data, business establishment data from 2002 to 2022 and job creation data. By combining these datasets owners with a powerful tool that can support informed decision-making,foster economic growth, and enhance the overall business environment within the City of Melbourne.

The dataset required for this project include:

  1. Street data providing the geographic coordinates of property address.
  2. Business establishment data covering the period from 2002 to 2022 with information on business address, industry classification (ANZSIC4), and CLUE small area location classification.
  3. Business establishments and jobs data offering insights into the number of jobs and business establishments categorised by business size, CLUE industry, ANZSIC1 and CLUE small area allocation.

This project involves a series of steps, including data cleaning, data exploration, data preprocessing, exploratory data analysis, and predictive modeling. By leveraging these techniques, the project ai,s to uncover patterns, trends, and relationships within the City of Melbourne's business landscape, ultimately aid informed decision-making, fostering economic growth, and enhancing the overall business environment.

Importing Libraries¶

Importing the required python modules

In [3]:
import warnings
warnings.filterwarnings("ignore")

import requests
import numpy as np
import pandas as pd
import seaborn as sns
from io import StringIO
import matplotlib.pyplot as plt
import geopandas as gpd
from shapely.geometry import Point
import holoviews as hv
import folium
from folium.plugins import MarkerCluster, HeatMap
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, mean_absolute_error, r2_score
import plotly.express as px
from sklearn.linear_model import LogisticRegression, LinearRegression
from shapely import wkt
from IPython.display import display, HTML
from scipy.spatial import distance
from scipy.stats import gaussian_kde
import contextily as ctx
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.api import VAR
from math import sqrt
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
import tensorflow as tf
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense
In [4]:
# Street address

base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='street-addresses'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    street_add=pd.read_csv(StringIO(url_content),delimiter=';')  #renaming dataset
    print(street_add.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
                         geo_point_2d  \
0  -37.802381557572, 144.941473440919   
1  -37.816860132435, 144.969991449806   
2  -37.798830275265, 144.942872100233   
3  -37.810546771396, 144.970906397029   
4  -37.789293657657, 144.939794028368   
5  -37.800782747366, 144.951363910142   
6  -37.826775193662, 144.959358160779   
7  -37.810925617843, 144.965443591832   
8   -37.81275302482, 144.964263891172   
9   -37.821209833971, 144.95403377339   

                                           geo_shape  suburb_id   latitude  \
0  {"coordinates": [144.941473440919, -37.8023815...      592.0 -37.802382   
1  {"coordinates": [144.969991449806, -37.8168601...      591.0 -37.816860   
2  {"coordinates": [144.942872100233, -37.7988302...      592.0 -37.798830   
3  {"coordinates": [144.970906397029, -37.8105467...      591.0 -37.810547   
4  {"coordinates": [144.939794028368, -37.7892936...      592.0 -37.789294   
5  {"coordinates": [144.951363910142, -37.8007827...      592.0 -37.800783   
6  {"coordinates": [144.959358160779, -37.8267751...      596.0 -37.826775   
7  {"coordinates": [144.965443591832, -37.8109256...      591.0 -37.810926   
8  {"coordinates": [144.964263891172, -37.8127530...      591.0 -37.812753   
9  {"coordinates": [144.95403377339, -37.82120983...      599.0 -37.821210   

  street_no              str_name                          address_pnt  \
0       133        Laurens Street  133 Laurens Street  North Melbourne   
1       129       Flinders Street       129 Flinders Street  Melbourne   
2        44         Macaulay Road    44 Macaulay Road  North Melbourne   
3        13            Punch Lane             13 Punch Lane  Melbourne   
4        61       Racecourse Road  61 Racecourse Road  North Melbourne   
5       133        Leveson Street  133 Leveson Street  North Melbourne   
6       263             City Road             263 City Road  Southbank   
7         3    Albert Coates Lane      3 Albert Coates Lane  Melbourne   
8       280  Little Bourke Street  280 Little Bourke Street  Melbourne   
9       612       Flinders Street       612 Flinders Street  Docklands   

         easting      northing  gisid   longitude           suburb  street_id  \
0  318773.161972  5.814115e+06  48531  144.941473  North Melbourne        781   
1  321318.983104  5.812563e+06  37711  144.969991        Melbourne        636   
2  318887.633593  5.814512e+06  30476  144.942872  North Melbourne        847   
3  321384.307636  5.813265e+06  35165  144.970906        Melbourne       1003   
4  318593.277492  5.815564e+06  22247  144.939794  North Melbourne     119624   
5  319640.094405  5.814311e+06  54777  144.951364  North Melbourne        791   
6  320406.983288  5.811443e+06  46380  144.959358        Southbank       1345   
7  320904.296420  5.813213e+06  14049  144.965444        Melbourne     117923   
8  320804.859001  5.813008e+06  28925  144.964264        Melbourne        809   
9  319924.815030  5.812050e+06   1278  144.954034        Docklands        636   

  add_comp  
0      NaN  
1      NaN  
2      NaN  
3      NaN  
4      NaN  
5      NaN  
6      NaN  
7      NaN  
8      NaN  
9      NaN  
In [5]:
# Business establishments and jobs data by business size and industry

base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='business-establishments-and-jobs-data-by-business-size-and-industry'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC' }

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    # Renaming the dataset as bizsize_industry jobs 
    industrysize_jobs =pd.read_csv(StringIO(url_content),delimiter=';')   
    print(industrysize_jobs.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
   census_year               clue_small_area  \
0         2015  West Melbourne (Residential)   
1         2015  West Melbourne (Residential)   
2         2015  West Melbourne (Residential)   
3         2015  West Melbourne (Residential)   
4         2015  West Melbourne (Residential)   
5         2015  West Melbourne (Residential)   
6         2014                       Carlton   
7         2014                       Carlton   
8         2014                       Carlton   
9         2014                       Carlton   

                                   anzsic_indusrty  \
0                Health Care and Social Assistance   
1                                    Manufacturing   
2                                    Manufacturing   
3  Professional, Scientific and Technical Services   
4          Rental, Hiring and Real Estate Services   
5                                  Wholesale Trade   
6              Administrative and Support Services   
7                                     Construction   
8                 Financial and Insurance Services   
9                 Financial and Insurance Services   

                       clue_industry    business_size  total_establishments  \
0  Health Care and Social Assistance   Large business                     1   
1                      Manufacturing  Medium business                     5   
2                      Manufacturing    Non employing                     1   
3                  Business Services    Non employing                     3   
4               Real Estate Services   Small business                     5   
5                    Wholesale Trade    Non employing                     2   
6         Admin and Support Services  Medium business                     7   
7                       Construction   Small business                    11   
8              Finance and Insurance  Medium business                     5   
9              Finance and Insurance   Small business                    35   

   total_jobs  
0         NaN  
1       171.0  
2         NaN  
3         0.0  
4        42.0  
5         NaN  
6       209.0  
7        55.0  
8       315.0  
9       216.0  
In [6]:
# Business establishments with address and industry classification

base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='business-establishments-with-address-and-industry-classification'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    # Renaming the dataset as bizaddressindustry
    industryinfo_classification=pd.read_csv(StringIO(url_content),delimiter=';')    
    print(industryinfo_classification.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
   census_year  block_id  property_id  base_property_id clue_small_area  \
0         2017       266       109851            109851         Carlton   
1         2017       266       109851            109851         Carlton   
2         2017       266       534003            534003         Carlton   
3         2017       266       664003            664003         Carlton   
4         2017       266       664005            664005         Carlton   
5         2017       266       664009            664009         Carlton   
6         2017       267       102614            102614         Carlton   
7         2017       267       102614            102614         Carlton   
8         2017       267       106251            106251         Carlton   
9         2017       267       106251            106251         Carlton   

                   trading_name  \
0  Metropoli's Research Pty Ltd   
1             J Hong Restaurant   
2                  St2 Expresso   
3            RMIT Resources Ltd   
4                        vacant   
5    RMIT Resources Ltd Bldg 42   
6                 J Trioli & Co   
7                  Marie August   
8           Downtowner On Lygon   
9   Mellow's Restaurant and Bar   

                                business_address  industry_anzsic4_code  \
0       Level 1, 74 Victoria Street CARLTON 3053                   6950   
1       Ground , 74 Victoria Street CARLTON 3053                   4511   
2                70 Victoria Street CARLTON 3053                   4512   
3                20 Cardigan Street CARLTON 3053                   8102   
4                24 Cardigan Street CARLTON 3053                      0   
5                 36-36 Earl Street CARLTON 3053                   8102   
6       Level 1, 33 Drummond Street CARLTON 3053                   6932   
7  Part Ground , 33 Drummond Street CARLTON 3053                   8512   
8                   66 Lygon Street CARLTON 3053                   4400   
9     Part Ground , 66 Lygon Street CARLTON 3053                   4511   

               industry_anzsic4_description   longitude   latitude  
0  Market Research and Statistical Services  144.965352 -37.806701  
1                     Cafes and Restaurants  144.965352 -37.806701  
2                    Takeaway Food Services  144.965473 -37.806714  
3                          Higher Education  144.964753 -37.806312  
4                              Vacant Space  144.964772 -37.806203  
5                          Higher Education  144.964974 -37.805887  
6                       Accounting Services  144.967047 -37.806081  
7               Specialist Medical Services  144.967047 -37.806081  
8                             Accommodation  144.966575 -37.805303  
9                     Cafes and Restaurants  144.966575 -37.805303  

Data Snippets¶

To understand and gain familirity with our dataset, we first examine its structure by checking the number of rows and columns of different dataset. This will help us understanding the datasize and complexity, revealing how different variables and stored and their formats. It would help us understand which columns might futher require further cleaning or transformation and gives us insights into the nature of data we'll be working.

In [7]:
#Street address data set

street_add.head()
Out[7]:
geo_point_2d geo_shape suburb_id latitude street_no str_name address_pnt easting northing gisid longitude suburb street_id add_comp
0 -37.802381557572, 144.941473440919 {"coordinates": [144.941473440919, -37.8023815... 592.0 -37.802382 133 Laurens Street 133 Laurens Street North Melbourne 318773.161972 5.814115e+06 48531 144.941473 North Melbourne 781 NaN
1 -37.816860132435, 144.969991449806 {"coordinates": [144.969991449806, -37.8168601... 591.0 -37.816860 129 Flinders Street 129 Flinders Street Melbourne 321318.983104 5.812563e+06 37711 144.969991 Melbourne 636 NaN
2 -37.798830275265, 144.942872100233 {"coordinates": [144.942872100233, -37.7988302... 592.0 -37.798830 44 Macaulay Road 44 Macaulay Road North Melbourne 318887.633593 5.814512e+06 30476 144.942872 North Melbourne 847 NaN
3 -37.810546771396, 144.970906397029 {"coordinates": [144.970906397029, -37.8105467... 591.0 -37.810547 13 Punch Lane 13 Punch Lane Melbourne 321384.307636 5.813265e+06 35165 144.970906 Melbourne 1003 NaN
4 -37.789293657657, 144.939794028368 {"coordinates": [144.939794028368, -37.7892936... 592.0 -37.789294 61 Racecourse Road 61 Racecourse Road North Melbourne 318593.277492 5.815564e+06 22247 144.939794 North Melbourne 119624 NaN
In [8]:
# Street addressess dataset datatype information

street_add.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63721 entries, 0 to 63720
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   geo_point_2d  63721 non-null  object 
 1   geo_shape     63721 non-null  object 
 2   suburb_id     63715 non-null  float64
 3   latitude      63721 non-null  float64
 4   street_no     63069 non-null  object 
 5   str_name      63720 non-null  object 
 6   address_pnt   63721 non-null  object 
 7   easting       63721 non-null  float64
 8   northing      63721 non-null  float64
 9   gisid         63721 non-null  int64  
 10  longitude     63721 non-null  float64
 11  suburb        62991 non-null  object 
 12  street_id     63721 non-null  int64  
 13  add_comp      1352 non-null   object 
dtypes: float64(5), int64(2), object(7)
memory usage: 6.8+ MB

The street address dataset consists of 14 columns with a total of 63721 rows. The street address dataset contains the location information including latitude, longitute, Northing and Easting of every property address within the City of Melbourne.

In [9]:
# Business establishments and jobs data by business size and industry 
industrysize_jobs.head()
Out[9]:
census_year clue_small_area anzsic_indusrty clue_industry business_size total_establishments total_jobs
0 2015 West Melbourne (Residential) Health Care and Social Assistance Health Care and Social Assistance Large business 1 NaN
1 2015 West Melbourne (Residential) Manufacturing Manufacturing Medium business 5 171.0
2 2015 West Melbourne (Residential) Manufacturing Manufacturing Non employing 1 NaN
3 2015 West Melbourne (Residential) Professional, Scientific and Technical Services Business Services Non employing 3 0.0
4 2015 West Melbourne (Residential) Rental, Hiring and Real Estate Services Real Estate Services Small business 5 42.0
In [10]:
# Business establishments and jobs data by business size and industry 
#datatype information

industrysize_jobs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14692 entries, 0 to 14691
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   census_year           14692 non-null  int64  
 1   clue_small_area       14692 non-null  object 
 2   anzsic_indusrty       14692 non-null  object 
 3   clue_industry         14692 non-null  object 
 4   business_size         14692 non-null  object 
 5   total_establishments  14692 non-null  int64  
 6   total_jobs            10365 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 803.6+ KB

The industrysize_job dataset consists of 6 columns with a total of 14692 rows. This dataset provides information on different industries classified according to ANZSIC, CLUE industry, CLUE small area classification, and total jobs creation for different business size.

In [11]:
# Business establishments location and industry classification
industryinfo_classification.head()
Out[11]:
census_year block_id property_id base_property_id clue_small_area trading_name business_address industry_anzsic4_code industry_anzsic4_description longitude latitude
0 2017 266 109851 109851 Carlton Metropoli's Research Pty Ltd Level 1, 74 Victoria Street CARLTON 3053 6950 Market Research and Statistical Services 144.965352 -37.806701
1 2017 266 109851 109851 Carlton J Hong Restaurant Ground , 74 Victoria Street CARLTON 3053 4511 Cafes and Restaurants 144.965352 -37.806701
2 2017 266 534003 534003 Carlton St2 Expresso 70 Victoria Street CARLTON 3053 4512 Takeaway Food Services 144.965473 -37.806714
3 2017 266 664003 664003 Carlton RMIT Resources Ltd 20 Cardigan Street CARLTON 3053 8102 Higher Education 144.964753 -37.806312
4 2017 266 664005 664005 Carlton vacant 24 Cardigan Street CARLTON 3053 0 Vacant Space 144.964772 -37.806203
In [12]:
# Business establishments location and industry classification datatype information
industryinfo_classification.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374210 entries, 0 to 374209
Data columns (total 11 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   374210 non-null  int64  
 1   block_id                      374210 non-null  int64  
 2   property_id                   374210 non-null  int64  
 3   base_property_id              374210 non-null  int64  
 4   clue_small_area               374210 non-null  object 
 5   trading_name                  374083 non-null  object 
 6   business_address              374209 non-null  object 
 7   industry_anzsic4_code         374210 non-null  int64  
 8   industry_anzsic4_description  374210 non-null  object 
 9   longitude                     369425 non-null  float64
 10  latitude                      369425 non-null  float64
dtypes: float64(2), int64(5), object(4)
memory usage: 31.4+ MB

The industryinfo_classification dataset has 374,210 rows with 11 columns.

Data Preprocessing¶

Data preprocessing is an essential step to ensure the dataset are ready for futher analysis. In this step, we begin by identifying and addressing any missing values to ensure that our data is complete and reliable. Next, we focus on detecting and removing any duplicate entries that may skew the analysis leading to unaccurate results. By cleaning data, we ehance the quality and integrity of the datasets, making them suitable for detailed exploration, analysis, and model building in subsequent steps.

Street_add¶

Preprocessing street addresses datasets.

In [13]:
# Detecting missing values
missing_values = street_add.isnull().sum()

print(missing_values)
geo_point_2d        0
geo_shape           0
suburb_id           6
latitude            0
street_no         652
str_name            1
address_pnt         0
easting             0
northing            0
gisid               0
longitude           0
suburb            730
street_id           0
add_comp        62369
dtype: int64

In the street address dataset, the feature 'add_comp' has total of 62,369 missing values which is more than half entries.

In [14]:
# Removing the 'add_comp'

street_add1 = street_add.drop(columns=['add_comp'])

street_add1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63721 entries, 0 to 63720
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   geo_point_2d  63721 non-null  object 
 1   geo_shape     63721 non-null  object 
 2   suburb_id     63715 non-null  float64
 3   latitude      63721 non-null  float64
 4   street_no     63069 non-null  object 
 5   str_name      63720 non-null  object 
 6   address_pnt   63721 non-null  object 
 7   easting       63721 non-null  float64
 8   northing      63721 non-null  float64
 9   gisid         63721 non-null  int64  
 10  longitude     63721 non-null  float64
 11  suburb        62991 non-null  object 
 12  street_id     63721 non-null  int64  
dtypes: float64(5), int64(2), object(6)
memory usage: 6.3+ MB
In [15]:
#Dropping rows with missing values in the 'street_add1' dataset. 
street_add_cleaned = street_add1.dropna()
In [16]:
# Show the resulting DataFrame
street_add_cleaned1 = street_add_cleaned.isnull().sum()
print(street_add_cleaned1)
geo_point_2d    0
geo_shape       0
suburb_id       0
latitude        0
street_no       0
str_name        0
address_pnt     0
easting         0
northing        0
gisid           0
longitude       0
suburb          0
street_id       0
dtype: int64
In [17]:
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = street_add_cleaned.duplicated().sum()
print(f"Number of duplicate rows before removal: {duplicate_count_before}")
Number of duplicate rows before removal: 0

Business establishments and jobs data by business size and industry¶

In [18]:
industrysize_jobs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14692 entries, 0 to 14691
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   census_year           14692 non-null  int64  
 1   clue_small_area       14692 non-null  object 
 2   anzsic_indusrty       14692 non-null  object 
 3   clue_industry         14692 non-null  object 
 4   business_size         14692 non-null  object 
 5   total_establishments  14692 non-null  int64  
 6   total_jobs            10365 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 803.6+ KB
In [19]:
# Detecting missing values in the dataset
missing_values = industrysize_jobs.isnull().sum()
print(missing_values)
census_year                0
clue_small_area            0
anzsic_indusrty            0
clue_industry              0
business_size              0
total_establishments       0
total_jobs              4327
dtype: int64
In [20]:
#Calculating the composition the dataset based on business sizes.
business_size_industryjobs = industrysize_jobs['business_size'].value_counts()

print(business_size_industryjobs)
Small business     5201
Medium business    4392
Non employing      2931
Large business     2168
Name: business_size, dtype: int64
In [21]:
# Calculate the number of missing values in 'total job' based and grouping business sizes. 
missing_values_jobs_business_size = industrysize_jobs[industrysize_jobs['total_jobs'].isnull()].groupby('business_size')['total_jobs'].size()

#Displaying the different missing values categorised based on different business size
print(missing_values_jobs_business_size)
business_size
Large business     1092
Medium business    1297
Non employing      1211
Small business      727
Name: total_jobs, dtype: int64
In [22]:
# Grouping data by year and business size
grouped_data = industrysize_jobs.groupby(['census_year', 'business_size'])['total_jobs'].sum().unstack()

# Plotting
plt.figure(figsize=(10, 6))
for column in grouped_data.columns:
    plt.plot(grouped_data.index, grouped_data[column], marker='o', label=column)

plt.title('Total Number of Jobs Over the Years by Business Size')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend(title='Business Size')
plt.grid(True)
plt.show()
In [23]:
# Descriptive statistic by grouping different business size
industrysize_jobs_desc = industrysize_jobs.groupby('business_size')['total_jobs'].describe()

industrysize_jobs_desc
Out[23]:
count mean std min 25% 50% 75% max
business_size
Large business 1076.0 12633.487918 32292.079142 687.0 2084.5 3882.0 11539.25 273200.0
Medium business 3095.0 2418.815509 10420.953933 66.0 222.5 409.0 1267.00 153302.0
Non employing 1720.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Small business 4474.0 984.121144 5117.223702 3.0 48.0 108.0 393.50 80008.0
In [24]:
# Replacing missing value with min value based on business size
industrysize_jobs['total_jobs'] = industrysize_jobs.groupby('business_size')['total_jobs'].transform(lambda x: x.fillna(x.min()))
In [25]:
# Grouping data by year and business size
grouped_data = industrysize_jobs.groupby(['census_year', 'business_size'])['total_jobs'].sum().unstack()

# Plotting
plt.figure(figsize=(10, 6))
for column in grouped_data.columns:
    plt.plot(grouped_data.index, grouped_data[column], marker='o', label=column)

plt.title('Total Number of Jobs Over the Years by Business Size')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend(title='Business Size')
plt.grid(True)
plt.show()

The industrysize_jobs dataset contains 4,327 missing entries in the total_jobs column, which indicates the total number of jobs or employees hired across various business sizes. To address these missing values, we have imputed them with the minimum number of employees typical for each respective business size.

Using the minimum number of employees for imputation is practical because it provides a conservative estimate that avoids overestimating the employment impact of smaller businesses.

In [26]:
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = industrysize_jobs.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
Number of duplicate rows: 0

Business establishments location and industry classification¶

In [27]:
industryinfo_classification.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374210 entries, 0 to 374209
Data columns (total 11 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   374210 non-null  int64  
 1   block_id                      374210 non-null  int64  
 2   property_id                   374210 non-null  int64  
 3   base_property_id              374210 non-null  int64  
 4   clue_small_area               374210 non-null  object 
 5   trading_name                  374083 non-null  object 
 6   business_address              374209 non-null  object 
 7   industry_anzsic4_code         374210 non-null  int64  
 8   industry_anzsic4_description  374210 non-null  object 
 9   longitude                     369425 non-null  float64
 10  latitude                      369425 non-null  float64
dtypes: float64(2), int64(5), object(4)
memory usage: 31.4+ MB
In [28]:
# Detect missing values
missing_values = industryinfo_classification.isnull().sum()
print(missing_values)
census_year                        0
block_id                           0
property_id                        0
base_property_id                   0
clue_small_area                    0
trading_name                     127
business_address                   1
industry_anzsic4_code              0
industry_anzsic4_description       0
longitude                       4785
latitude                        4785
dtype: int64
In [29]:
# Remove all rows which have missing values
industryinfo_classified=industryinfo_classification.dropna()
In [30]:
industryinfo_classified.isnull().sum()
Out[30]:
census_year                     0
block_id                        0
property_id                     0
base_property_id                0
clue_small_area                 0
trading_name                    0
business_address                0
industry_anzsic4_code           0
industry_anzsic4_description    0
longitude                       0
latitude                        0
dtype: int64
In [31]:
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = industryinfo_classified.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
Number of duplicate rows: 0

We have succesfully cleaned the dataset and replaced missing values appropriately.

Exploratory Data Analysis¶

We will conduct an exploratory analysis of business establishments and jobs data, focusing on how businesses are distributed across different industries and sizes, and how these factors influence total employment. By leveraging various visualizations and trend analyses, we aim to uncover key patterns and insights that will provide a deeper understanding of the economic landscape and inform strategic decision-making.

Initially, the exploratory data analysis is performed for the "Business establishments and jobs data by business size and industry" to get the hidden facts from the datatset.

In [32]:
# Create a pivot table
pivot_table = industrysize_jobs.pivot_table(
    values='total_jobs',
    index='anzsic_indusrty',
    columns='business_size',
    aggfunc='sum'
)

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt='.0f', cmap='YlGnBu')
plt.title('Heatmap of Total Jobs by Industry and Business Size')
plt.xlabel('Business Size')
plt.ylabel('Industry')
plt.show()

The heatmap reveals that large businesses are the primary contributors to job creation across most industries, particularly in "Financial and Insurance Services" and "Professional, Scientific and Technical Services." Medium and small businesses also play a vital role, especially in sectors like "Construction" and "Retail Trade." Non-employing businesses, though less impactful in job numbers, are present in all industries, with notable representation in "Retail Trade" and "Construction." This analysis underscores the significant role of large enterprises in employment, while highlighting the diversity and importance of smaller business sizes across various sectors.

In [33]:
# Create a stacked bar chart
plt.figure(figsize=(16, 8))
business_area_industry = industrysize_jobs.groupby(['clue_small_area', 'anzsic_indusrty']).agg({
    'total_establishments': 'sum'
}).unstack().fillna(0)

business_area_industry.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Total Establishments by Area and Industry')
plt.xlabel('Clue Small Area')
plt.ylabel('Total Establishments')
plt.xticks(rotation=45)
plt.legend(title='Industry', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

The bar chart illustrates the distribution of total business establishments across various Clue Small Areas within the City of Melbourne. The "City of Melbourne (total)" area, unsurprisingly, has the highest concentration of establishments, with a significant dominance in the "All ANZSIC" category, followed by "Melbourne (CBD)" which shows a diverse representation across multiple industries. Other areas like Docklands, South Yarra, and Port Melbourne have a much smaller, yet varied, distribution of establishments across different industries. This visualization highlights the central role of Melbourne's CBD in hosting a large number of businesses, reflecting its importance as a key economic hub in the region.

In [34]:
# Creating a pairplot 
plt.figure(figsize=(12, 8))  
sns.pairplot(industrysize_jobs, hue='business_size', vars=['total_establishments', 'total_jobs'])

# Set title (note: title will not appear directly on the pairplot)
plt.suptitle('Pair Plot of Total Establishments and Total Jobs by Business Size', y=1.02, fontsize=16)

plt.show()
<Figure size 1200x800 with 0 Axes>

The pair plot visualizes the relationships between total establishments and total jobs across different business sizes. It shows that large businesses generally have a higher number of jobs but fewer establishments compared to small and medium-sized businesses. Small businesses, represented by red dots, are more concentrated in both the number of establishments and jobs, indicating a larger presence but fewer employees per establishment. Medium businesses occupy a middle ground, while non-employing businesses (green) are clustered close to the origin, reflecting minimal job creation despite a number of establishments. This plot helps to identify the scale and impact of different business sizes within the dataset.

In [35]:
# Convert the 'census_year' column to datetime format, if it's not already
industrysize_jobs['census_year'] = pd.to_datetime(industrysize_jobs['census_year'], errors='coerce', format='%Y')

# Group data by year and industry, then calculate the sum of total establishments and total jobs
industry_yearly_data = industrysize_jobs.groupby([industrysize_jobs['census_year'].dt.year, 'anzsic_indusrty']).agg({
    'total_establishments': 'sum',
    'total_jobs': 'sum'
}).reset_index()

# Create a line plot for total establishments over time, by industry
fig_establishments = px.line(
    industry_yearly_data,
    x='census_year',
    y='total_establishments',
    color='anzsic_indusrty',
    title='Yearly Trends in Total Establishments by Industry',
    labels={'census_year': 'Year', 'total_establishments': 'Total Establishments'},
    markers=True
)

fig_establishments.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Establishments',
    legend_title_text='Industry',
    width=900,
    height=600
)

fig_establishments.show()

# Create a line plot for total jobs over time, by industry
fig_jobs = px.line(
    industry_yearly_data,
    x='census_year',
    y='total_jobs',
    color='anzsic_indusrty',
    title='Yearly Trends in Total Jobs by Industry',
    labels={'census_year': 'Year', 'total_jobs': 'Total Jobs'},
    markers=True
)

fig_jobs.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Jobs',
    legend_title_text='Industry',
    width=900,
    height=600
)

fig_jobs.show()

It shows the yearly trends in total establishments by industry in City of Melbourne from 2005 to 2020. The graph displays the number of total establishments for various industries over this time period. The industries shown include Accommodation and Food Services, Administrative and Support Services, Agriculture, Forestry and Fishing, All ANZSIC, Arts and Recreation Services, Construction, Education and Training, Electricity, Gas, Water and Waste Services, Financial and Insurance Services, Health Care and Social Assistance, Information Media and Telecommunications, Manufacturing, Mining, Other Services, Professional, Scientific and Technical Services, Public Administration and Safety, Rental, Hiring and Real Estate Services, Retail Trade, and Transport, Postal and Warehousing, and Wholesale Trade.

The visualization shows the yearly trends in total jobs by industry in a stacked area chart. The chart displays the changes in the number of jobs across various industries over the years from 2005 to 2020. Each industry is represented by a different color, allowing for a clear visualization of the relative growth or decline in the number of jobs within each industry over time. The chart provides a comprehensive overview of the dynamic nature of the job market, highlighting the industries that have experienced significant changes, such as Accommodation and Food Services, Administrative and Support Services, and Manufacturing. This information can be valuable for understanding the economic landscape, identifying emerging industries, and making informed decisions about workforce development and business strategies.

Further the analysis is performed for the "Business establishments location and industry classification" to uncover the hidden information for the dataset.

The worldcloud is displayed and most important words with the business names are shown here. It is very useful to finds them.

In [36]:
# Word Cloud of Business Names
wordcloud = WordCloud(width=1600, height=800, background_color='white').generate(' '.join(industryinfo_classified['trading_name']))
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Business Names')
plt.show()

A correlation heatmap of numeric variables. The heatmap displays the correlation coefficients between different variables, such as census_year, block_id, property_id, base_property_id, industry_anzsic4_code, longitude, latitude, and others. The color scale ranges from -1 (strong negative correlation, shown in dark red) to 1 (strong positive correlation, shown in dark blue), with white indicating no correlation.

Some key observations from the heatmap:

  1. The variable "census_year" has a strong positive correlation (1.0) with itself, as expected.
  2. The variables "block_id" and "property_id" have a moderate positive correlation (0.014).
  3. The variables "base_property_id" and "block_id" have a moderate positive correlation (0.56).
  4. The variable "industry_anzsic4_code" has a weak negative correlation with several other variables,such as "census_year" (-0.085) and "longitude" (-0.4).
  5. The variables "longitude" and "latitude" have a moderate negative correlation (-0.21).

Overall, the heatmap provides a visual representation of the relationships between the different numeric variables in the dataset. Claude can make mistakes. Please double-check responses.

In [37]:
# Correlation Heatmap
numeric_df = industryinfo_classified.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 10))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Variables')
plt.show()

A business density heatmap by CLUE (Census Local Unit Establishments) Small Area and Census Year for various locations in the City of Melbourne is presented here. The heatmap displays the number of businesses or establishments in each CLUE Small Area over time, from 2002 to 2022.

The key observations from the heatmap are:

  1. The area with the highest business density is Melbourne (CBD), which has over 10,000 businesses throughout the time period.
  2. Other areas with high business density include Docklands, East Melbourne, and Southbank.
  3. Suburban areas like Kensington, Parkville, and South Yarra have relatively lower business densities compared to the central and inner-city areas.
  4. Some areas, such as West Melbourne (Residential) and West Melbourne (Industrial), have relatively stable business densities over the years.
  5. The heatmap provides a detailed view of the changes in business density across different CLUE Small Areas and over the census years.

This information can be useful for understanding the spatial distribution and trends of business activities in the Melbourne metropolitan area.

In [38]:
# Business Density Heatmap by CLUE Small Area
pivot = industryinfo_classified.pivot_table(
    index='clue_small_area',
    columns='census_year',
    values='property_id',
    aggfunc='count'
)
plt.figure(figsize=(15, 10))
sns.heatmap(pivot, cmap='YlOrRd', annot=True, fmt='g')
plt.title('Business Density Heatmap by CLUE Small Area and Census Year')
plt.xlabel('Census Year')
plt.ylabel('CLUE Small Area')
plt.show()

The industry composition in the top 10 CLUE (Census Local Unit Establishments) Small Areas in Melbourne, Australia given here. The vertical axis represents the total number of businesses, and the horizontal axis lists the different CLUE Small Areas.

The key observations from the chart are:

  1. The Melbourne (CBD) area has the highest total number of businesses, with a significant portion in the "Vacant Space" industry.
  2. Other areas like Carlton, Docklands, and North Melbourne also have a large presence of "Vacant Space" businesses.
  3. Cafes and Restaurants are a prominent industry in many of the CLUE Small Areas, particularly in areas like Melbourne (CBD), Carlton, and Southbank.
  4. Legal Services and Takeaway Food Services are also noticeable industries across several CLUE Small Areas.
  5. Computer System Design and Related Services is an industry with a relatively smaller presence compared to the others shown in the chart.

This information provides insights into the industry composition and business landscape within the top 10 CLUE Small Areas in Melbourne, which can be useful for understanding the economic activity and trends in the city.

In [39]:
# Industry Composition by CLUE Small Area
top_areas = industryinfo_classified['clue_small_area'].value_counts().nlargest(10).index
top_industries = industryinfo_classified['industry_anzsic4_description'].value_counts().nlargest(5).index

industry_area = pd.crosstab(
    industryinfo_classified['clue_small_area'],
    industryinfo_classified['industry_anzsic4_description']
)
industry_area = industry_area.loc[top_areas, top_industries]

industry_area.plot(kind='bar', stacked=True, figsize=(15, 10))
plt.title('Industry Composition in Top 10 CLUE Small Areas')
plt.xlabel('CLUE Small Area')
plt.ylabel('Number of Businesses')
plt.legend(title='Industry', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

The grpagh shows the geographical distribution of businesses by industry in a specific area. Each dot on the map represents a business, color-coded by industry type. The map reveals the clustering and dispersion of different types of businesses across the geographic region. For example, there appears to be a concentration of businesses in certain industries in certain areas, while other industries are more spread out. This information could be useful for understanding the economic landscape and patterns of commercial activity in the region.

In [40]:
# Create a scatter plot with latitude, longitude, and industry
fig = px.scatter(
    industryinfo_classified,
    x='longitude',
    y='latitude',
    color='industry_anzsic4_description',  # Color by industry description
    size='block_id',  # Size by number of businesses in each block
    hover_name='trading_name',  # Show trading name on hover
    title='Geographical Distribution of Businesses by Industry',
    labels={'longitude': 'Longitude', 'latitude': 'Latitude'}
)

fig.show()

A treemap visualization of businesses by area and industry in Melbourne, Australia is shown here. The different areas of Melbourne (CBD, Carlton, Docklands, East Melbourne, West Melbourne, Port Melbourne, Melbourne Remainder, Southbank, North Melbourne, Kensington, Parkville, and South Yarra) are represented as the larger rectangles, and the industries within each area are shown as smaller nested rectangles.

The size of each rectangle corresponds to the relative number of businesses in that industry and area. The industries shown include Vacant Space, Legal Services, Takeaway Food Services, Cafes and Restaurants, and others.

Some key observations:

  1. The Melbourne (CBD) area has the highest overall business density, with a large presence in Vacant Space, Legal Services, and Cafes and Restaurants.
  2. Docklands and North Melbourne also have significant business activity, particularly in Vacant Space and Cafes and Restaurants.
  3. Suburban areas like Kensington and Parkville have a smaller overall business presence compared to the central and inner-city regions.
  4. The industry composition varies across the different areas, with some areas specializing in certain industries more than others.

This treemap visualization provides a comprehensive overview of the spatial distribution and industry composition of businesses across the Melbourne metropolitan area, which can be useful for understanding the economic landscape and identifying potential opportunities or trends.

In [41]:
import plotly.express as px

# Prepare the data for treemap
treemap_data = industryinfo_classified.groupby(['clue_small_area', 'industry_anzsic4_description']).size().reset_index(name='counts')

# Create the treemap
fig = px.treemap(treemap_data,
                 path=['clue_small_area', 'industry_anzsic4_description'],
                 values='counts',
                 title='Treemap of Businesses by Area and Industry')

fig.show()

The distribution of easting (a geographic coordinate) across different suburbs in the Melbourne area is listed here. Each data point represents a suburb, and the vertical axis shows the easting value for that suburb.

Some key observations:

  1. The suburbs with the highest easting values are located in the eastern and central parts of the Melbourne metropolitan area, such as North Melbourne, Melbourne, and Southbank.
  2. The suburbs with the lowest easting values are in the western and outer suburbs, like Port Melbourne, Williamstown, and South Yarra.
  3. There is a wide range of easting values across the different suburbs, indicating the geographical spread of the suburbs in the east-west direction.
  4. The box plots provide a summary of the distribution of easting values for each suburb, showing the median, interquartile range, and any outliers.
  5. The variation in easting values reflects the spatial distribution and layout of the suburbs within the Melbourne region.

This visualization is useful for understanding the geographic positioning of the different suburbs in relation to the east-west axis, which may be relevant for urban planning, transportation, or other spatial analyses involving the Melbourne metropolitan area.

In [42]:
# Box plot of easting across suburbs
plt.figure(figsize=(14, 6))
sns.boxplot(data=street_add_cleaned, x='suburb', y='easting')
plt.xticks(rotation=90)
plt.title('Distribution of Easting Across Suburbs')
plt.xlabel('Suburb')
plt.ylabel('Easting')
plt.show()

# Box plot of northing across suburbs
plt.figure(figsize=(14, 6))
sns.boxplot(data=street_add_cleaned, x='suburb', y='northing')
plt.xticks(rotation=90)
plt.title('Distribution of Northing Across Suburbs')
plt.xlabel('Suburb')
plt.ylabel('Northing')
plt.show()

Data Modelling¶

For data modeling, we plan to implement both classification and regression analysis to derive actionable insights from our dataset. The classification model will predict the business size (business_size) as our target class, utilizing features such as census_year, longitude, latitude, clue_small_area, industry_anzsic4_code, and total_establishments. This analysis will help in understanding the factors that determine whether a business is classified as small, medium, or large.

Simultaneously, the regression model will focus on predicting the total number of jobs (total_jobs) as the target variable, leveraging the same set of features. This regression analysis will allow us to forecast employment trends based on business characteristics and location data. Through careful selection of features and model tuning, we aim to uncover patterns that can inform economic and urban planning decisions, with the classification model providing insights into business categorization and the regression model offering projections on job creation.

Now we want to to merge the data for modeling phase.

In [43]:
street_add_cleaned.info()
industrysize_jobs.info()
industryinfo_classified.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62338 entries, 0 to 63720
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   geo_point_2d  62338 non-null  object 
 1   geo_shape     62338 non-null  object 
 2   suburb_id     62338 non-null  float64
 3   latitude      62338 non-null  float64
 4   street_no     62338 non-null  object 
 5   str_name      62338 non-null  object 
 6   address_pnt   62338 non-null  object 
 7   easting       62338 non-null  float64
 8   northing      62338 non-null  float64
 9   gisid         62338 non-null  int64  
 10  longitude     62338 non-null  float64
 11  suburb        62338 non-null  object 
 12  street_id     62338 non-null  int64  
dtypes: float64(5), int64(2), object(6)
memory usage: 6.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14692 entries, 0 to 14691
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   census_year           14692 non-null  datetime64[ns]
 1   clue_small_area       14692 non-null  object        
 2   anzsic_indusrty       14692 non-null  object        
 3   clue_industry         14692 non-null  object        
 4   business_size         14692 non-null  object        
 5   total_establishments  14692 non-null  int64         
 6   total_jobs            14692 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 803.6+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 374209
Data columns (total 11 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
dtypes: float64(2), int64(5), object(4)
memory usage: 33.8+ MB

Merging the dataset

In [44]:
# Ensure Busi_jobs_size_industry has only one row per clue_small_area
industrysize_jobs_unique = industrysize_jobs.groupby('clue_small_area').first().reset_index()


# Merge the DataFrames
merged_df5 = pd.merge(industryinfo_classified,
                     industrysize_jobs_unique[['clue_small_area', 'anzsic_indusrty', 'clue_industry', 'business_size', 'total_establishments', 'total_jobs']],
                     on='clue_small_area',
                     how='left')

# Check the result
print(f"Number of rows in merged DataFrame: {len(merged_df5)}")
print(merged_df5.info())
Number of rows in merged DataFrame: 369329
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 16 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
dtypes: float64(3), int64(6), object(7)
memory usage: 47.9+ MB
None
In [45]:
# Detect missing values
missing_values = merged_df5.isnull().sum()
print(missing_values)

# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df5.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
census_year                     0
block_id                        0
property_id                     0
base_property_id                0
clue_small_area                 0
trading_name                    0
business_address                0
industry_anzsic4_code           0
industry_anzsic4_description    0
longitude                       0
latitude                        0
anzsic_indusrty                 0
clue_industry                   0
business_size                   0
total_establishments            0
total_jobs                      0
dtype: int64
Number of duplicate rows: 0
In [46]:
# Merge street_address and merge data on latitude and longitude
merged_df2 = pd.merge(merged_df5, street_add_cleaned,
                     on=['latitude', 'longitude'],
                     how='left')

merged_df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 27 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
 16  geo_point_2d                  0 non-null       object 
 17  geo_shape                     0 non-null       object 
 18  suburb_id                     0 non-null       float64
 19  street_no                     0 non-null       object 
 20  str_name                      0 non-null       object 
 21  address_pnt                   0 non-null       object 
 22  easting                       0 non-null       float64
 23  northing                      0 non-null       float64
 24  gisid                         0 non-null       float64
 25  suburb                        0 non-null       object 
 26  street_id                     0 non-null       float64
dtypes: float64(8), int64(6), object(13)
memory usage: 78.9+ MB
In [47]:
# Detect missing values
missing_values = merged_df2.isnull().sum()
print(missing_values)

# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df2.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")

# List of columns to remove
columns_to_remove = [
    'geo_point_2d',
    'geo_shape',
    'suburb_id',
    'street_no',
    'str_name',
    'address_pnt',
    'easting',
    'northing',
    'gisid',
    'suburb',
    'street_id'
]

# Remove the specified columns
cleaned_df = merged_df2.drop(columns=columns_to_remove, errors='ignore')

print(cleaned_df.info())
census_year                          0
block_id                             0
property_id                          0
base_property_id                     0
clue_small_area                      0
trading_name                         0
business_address                     0
industry_anzsic4_code                0
industry_anzsic4_description         0
longitude                            0
latitude                             0
anzsic_indusrty                      0
clue_industry                        0
business_size                        0
total_establishments                 0
total_jobs                           0
geo_point_2d                    369329
geo_shape                       369329
suburb_id                       369329
street_no                       369329
str_name                        369329
address_pnt                     369329
easting                         369329
northing                        369329
gisid                           369329
suburb                          369329
street_id                       369329
dtype: int64
Number of duplicate rows: 0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 16 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
dtypes: float64(3), int64(6), object(7)
memory usage: 47.9+ MB
None
In [48]:
cleaned_df.head()
Out[48]:
census_year block_id property_id base_property_id clue_small_area trading_name business_address industry_anzsic4_code industry_anzsic4_description longitude latitude anzsic_indusrty clue_industry business_size total_establishments total_jobs
0 2017 266 109851 109851 Carlton Metropoli's Research Pty Ltd Level 1, 74 Victoria Street CARLTON 3053 6950 Market Research and Statistical Services 144.965352 -37.806701 Administrative and Support Services Admin and Support Services Medium business 7 209.0
1 2017 266 109851 109851 Carlton J Hong Restaurant Ground , 74 Victoria Street CARLTON 3053 4511 Cafes and Restaurants 144.965352 -37.806701 Administrative and Support Services Admin and Support Services Medium business 7 209.0
2 2017 266 534003 534003 Carlton St2 Expresso 70 Victoria Street CARLTON 3053 4512 Takeaway Food Services 144.965473 -37.806714 Administrative and Support Services Admin and Support Services Medium business 7 209.0
3 2017 266 664003 664003 Carlton RMIT Resources Ltd 20 Cardigan Street CARLTON 3053 8102 Higher Education 144.964753 -37.806312 Administrative and Support Services Admin and Support Services Medium business 7 209.0
4 2017 266 664005 664005 Carlton vacant 24 Cardigan Street CARLTON 3053 0 Vacant Space 144.964772 -37.806203 Administrative and Support Services Admin and Support Services Medium business 7 209.0
In [49]:
# Detect missing values
missing_values = merged_df5.isnull().sum()
print(missing_values)

# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df5.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
census_year                     0
block_id                        0
property_id                     0
base_property_id                0
clue_small_area                 0
trading_name                    0
business_address                0
industry_anzsic4_code           0
industry_anzsic4_description    0
longitude                       0
latitude                        0
anzsic_indusrty                 0
clue_industry                   0
business_size                   0
total_establishments            0
total_jobs                      0
dtype: int64
Number of duplicate rows: 0

Classification Model¶

The target for classification model is to be able to predict the business size (business_size) i.e., target class, using a robust set of features.

In [50]:
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 16 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
dtypes: float64(3), int64(6), object(7)
memory usage: 47.9+ MB
In [51]:
cleaned_df['business_size'].value_counts() # target class
Out[51]:
Large business     254033
Medium business     83769
Non employing       19451
Small business      12076
Name: business_size, dtype: int64

Mapping categorical variables to numerical variables

In [52]:
# Mapping business_size to numeric values
business_size_mapping = {
    'Large business': 3,
    'Medium business': 2,
    'Small business': 1,
    'Non employing': 0
}

# Apply the mapping to the 'business_size' column
cleaned_df['business_size_numeric'] = cleaned_df['business_size'].map(business_size_mapping)

# Check the conversion
print(cleaned_df[['business_size', 'business_size_numeric']].head())
     business_size  business_size_numeric
0  Medium business                      2
1  Medium business                      2
2  Medium business                      2
3  Medium business                      2
4  Medium business                      2
In [53]:
# Mapping clue_small_area to numeric values
clue_small_area_mapping = {
    'Melbourne (CBD)': 0,
    'Carlton': 1,
    'North Melbourne': 2,
    'Docklands': 3,
    'Southbank': 4,
    'East Melbourne': 5,
    'West Melbourne (Residential)': 6,
    'Port Melbourne': 7,
    'Kensington': 8,
    'Melbourne (Remainder)': 9,
    'West Melbourne (Industrial)': 10,
    'Parkville': 11,
    'South Yarra': 12
}

# Apply the mapping to the 'clue_small_area' column
cleaned_df['clue_small_area_numeric'] = cleaned_df['clue_small_area'].map(clue_small_area_mapping)

# Check the conversion
print(cleaned_df[['clue_small_area', 'clue_small_area_numeric']].head())
  clue_small_area  clue_small_area_numeric
0         Carlton                        1
1         Carlton                        1
2         Carlton                        1
3         Carlton                        1
4         Carlton                        1
In [54]:
# Mapping clue_industry to numeric values
clue_industry_mapping = {
    'Accommodation': 0,
    'Food and Beverage Services': 1,
    'Admin and Support Services': 2,
    'Health Care and Social Assistance': 3
}

# Apply the mapping to the 'clue_industry' column
cleaned_df['clue_industry_numeric'] = cleaned_df['clue_industry'].map(clue_industry_mapping)

# Check the conversion
print(cleaned_df[['clue_industry', 'clue_industry_numeric']].head())
                clue_industry  clue_industry_numeric
0  Admin and Support Services                      2
1  Admin and Support Services                      2
2  Admin and Support Services                      2
3  Admin and Support Services                      2
4  Admin and Support Services                      2
In [55]:
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
 16  business_size_numeric         369329 non-null  int64  
 17  clue_small_area_numeric       369329 non-null  int64  
 18  clue_industry_numeric         369329 non-null  int64  
dtypes: float64(3), int64(9), object(7)
memory usage: 56.4+ MB
In [56]:
# correlation plot

# Define the features and target variable
features = ['census_year', 'longitude', 'latitude', 'clue_small_area_numeric',
            'industry_anzsic4_code', 'total_establishments', 'clue_industry_numeric']
target = 'business_size_numeric'

# Create a new DataFrame with the selected features and target variable
correlation_df = cleaned_df[features + [target]]

# Calculate the correlation matrix
correlation_matrix = correlation_df.corr()

# Optionally, visualize the correlation matrix with a heatmap
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title("Correlation Matrix")
plt.show()
In [57]:
#  the features and target variable
features = ['longitude', 'latitude',
            'industry_anzsic4_code', 'total_establishments']
target = 'business_size_numeric'
In [58]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the scaler to the features and transform them
normalized_features = scaler.fit_transform(cleaned_df[features])

# Create a new DataFrame with normalized features
normalized_df = pd.DataFrame(normalized_features, columns=features)

# Display the first few rows of the normalized DataFrame
print(normalized_df.head())
   longitude  latitude  industry_anzsic4_code  total_establishments
0   0.466506  0.681440               0.620677               2.05035
1   0.466506  0.681440              -0.218055               2.05035
2   0.475255  0.679954              -0.217711               2.05035
3   0.422856  0.724366               1.016831               2.05035
4   0.424256  0.736360              -1.769314               2.05035
In [59]:
# Split the data into training and testing sets 70%, 30%
X_train, X_test, y_train, y_test = train_test_split(normalized_df[features], cleaned_df[target], test_size=0.3, random_state=42)
In [60]:
# Create and train the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Compute the classification report
class_summary = classification_report(y_test, y_pred, target_names=['Non employing', 'Small business', 'Medium business', 'Large business'],digits=4)

# Compute the confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)

# Display the results
print("Classification Report:\n", class_summary)
print("Confusion Matrix:\n", confusion_mat)

# Optionally, visualize the confusion matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'], yticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
Classification Report:
                  precision    recall  f1-score   support

  Non employing     1.0000    1.0000    1.0000      5879
 Small business     1.0000    1.0000    1.0000      3627
Medium business     1.0000    1.0000    1.0000     24968
 Large business     1.0000    1.0000    1.0000     76325

       accuracy                         1.0000    110799
      macro avg     1.0000    1.0000    1.0000    110799
   weighted avg     1.0000    1.0000    1.0000    110799

Confusion Matrix:
 [[ 5879     0     0     0]
 [    0  3627     0     0]
 [    0     0 24968     0]
 [    0     0     0 76325]]
In [61]:
# Create and train the Logistic Regression classifier
log_reg_classifier = LogisticRegression(
    C=0.5,  # Adjust regularization strength
    penalty='l2',  # Use L2 regularization
    solver='lbfgs',  # Use 'lbfgs' solver
    max_iter=1000,  # Keep the maximum iterations
    class_weight='balanced',  # Adjust weights for imbalanced classes
    tol=1e-4,  # Set a tolerance for convergence
    multi_class='multinomial'  # Use softmax for multi-class classification
)
log_reg_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred_log_reg = log_reg_classifier.predict(X_test)

# Compute the classification report
class_summary_log_reg = classification_report(y_test, y_pred_log_reg,
                                             target_names=['Non employing', 'Small business', 'Medium business', 'Large business'],
                                             digits=4)

# Compute the confusion matrix
confusion_mat_log_reg = confusion_matrix(y_test, y_pred_log_reg)

# Display the results
print("Classification Report for Logistic Regression:\n", class_summary_log_reg)
print("Confusion Matrix for Logistic Regression:\n", confusion_mat_log_reg)

# Optionally, visualize the confusion matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat_log_reg, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'],
            yticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()
Classification Report for Logistic Regression:
                  precision    recall  f1-score   support

  Non employing     0.8134    1.0000    0.8971      5879
 Small business     0.6284    0.9810    0.7661      3627
Medium business     0.7441    0.8345    0.7867     24968
 Large business     0.9695    0.8880    0.9270     76325

       accuracy                         0.8849    110799
      macro avg     0.7889    0.9259    0.8442    110799
   weighted avg     0.8993    0.8849    0.8885    110799

Confusion Matrix for Logistic Regression:
 [[ 5879     0     0     0]
 [    0  3558     8    61]
 [    0  2062 20837  2069]
 [ 1349    42  7157 67777]]

Neural Network multiclassification model¶

In [62]:
num_classes = len(y_train.unique()) 

model = Sequential([
    Dense(128,activation = 'relu', input_shape=(4,)),
    Dense(64,activation = 'relu'),
    Dense(num_classes,activation = 'softmax')
])

model.compile(optimizer='adam',
              loss = 'sparse_categorical_crossentropy',
              metrics =['accuracy'])

history = model.fit(X_train,y_train,epochs =20,batch_size=32,validation_split=0.2)

test_loss,test_acc = model.evaluate(X_test,y_test)
print("Test Accuracy ",test_acc)
Epoch 1/20
6464/6464 [==============================] - 5s 760us/step - loss: 0.0208 - accuracy: 0.9948 - val_loss: 0.0030 - val_accuracy: 0.9983
Epoch 2/20
6464/6464 [==============================] - 5s 714us/step - loss: 0.0026 - accuracy: 0.9991 - val_loss: 0.0013 - val_accuracy: 0.9996
Epoch 3/20
6464/6464 [==============================] - 5s 716us/step - loss: 0.0016 - accuracy: 0.9995 - val_loss: 5.9158e-04 - val_accuracy: 0.9999
Epoch 4/20
6464/6464 [==============================] - 5s 739us/step - loss: 0.0011 - accuracy: 0.9997 - val_loss: 4.7069e-04 - val_accuracy: 0.9999
Epoch 5/20
6464/6464 [==============================] - 5s 757us/step - loss: 6.7552e-04 - accuracy: 0.9998 - val_loss: 2.7480e-04 - val_accuracy: 0.9999
Epoch 6/20
6464/6464 [==============================] - 5s 755us/step - loss: 4.0970e-04 - accuracy: 0.9999 - val_loss: 2.5435e-04 - val_accuracy: 0.9999
Epoch 7/20
6464/6464 [==============================] - 5s 747us/step - loss: 5.3689e-04 - accuracy: 0.9999 - val_loss: 3.5930e-04 - val_accuracy: 0.9999
Epoch 8/20
6464/6464 [==============================] - 5s 767us/step - loss: 3.1196e-04 - accuracy: 0.9999 - val_loss: 7.4120e-04 - val_accuracy: 0.9996
Epoch 9/20
6464/6464 [==============================] - 5s 819us/step - loss: 4.6061e-04 - accuracy: 0.9999 - val_loss: 6.6086e-04 - val_accuracy: 0.9998
Epoch 10/20
6464/6464 [==============================] - 5s 772us/step - loss: 1.8890e-04 - accuracy: 0.9999 - val_loss: 1.1298e-05 - val_accuracy: 1.0000
Epoch 11/20
6464/6464 [==============================] - 5s 732us/step - loss: 4.6118e-04 - accuracy: 0.9999 - val_loss: 2.5336e-05 - val_accuracy: 1.0000
Epoch 12/20
6464/6464 [==============================] - 5s 721us/step - loss: 3.1193e-04 - accuracy: 0.9999 - val_loss: 1.1105e-04 - val_accuracy: 1.0000
Epoch 13/20
6464/6464 [==============================] - 5s 728us/step - loss: 3.8306e-04 - accuracy: 0.9999 - val_loss: 2.2264e-04 - val_accuracy: 0.9999
Epoch 14/20
6464/6464 [==============================] - 5s 734us/step - loss: 1.0564e-04 - accuracy: 1.0000 - val_loss: 1.6396e-05 - val_accuracy: 1.0000
Epoch 15/20
6464/6464 [==============================] - 5s 726us/step - loss: 3.6668e-04 - accuracy: 0.9999 - val_loss: 3.0403e-04 - val_accuracy: 0.9999
Epoch 16/20
6464/6464 [==============================] - 5s 728us/step - loss: 2.3317e-04 - accuracy: 0.9999 - val_loss: 4.3953e-05 - val_accuracy: 1.0000
Epoch 17/20
6464/6464 [==============================] - 5s 729us/step - loss: 1.5432e-04 - accuracy: 0.9999 - val_loss: 1.2579e-05 - val_accuracy: 1.0000
Epoch 18/20
6464/6464 [==============================] - 5s 745us/step - loss: 2.7530e-04 - accuracy: 0.9999 - val_loss: 0.0039 - val_accuracy: 0.9993
Epoch 19/20
6464/6464 [==============================] - 5s 727us/step - loss: 5.3759e-04 - accuracy: 0.9999 - val_loss: 1.2479e-05 - val_accuracy: 1.0000
Epoch 20/20
6464/6464 [==============================] - 5s 720us/step - loss: 1.2971e-04 - accuracy: 1.0000 - val_loss: 5.3994e-05 - val_accuracy: 0.9999
3463/3463 [==============================] - 2s 505us/step - loss: 2.5325e-05 - accuracy: 1.0000
Test Accuracy  0.9999819397926331
In [63]:
predictions = model.predict(X_test)
predicted_classes = predictions.argmax(axis=1)
In [64]:
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')           
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')      
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Regression Modelling¶

In this phase of the analysis, we will apply regression techniques to model the relationships between the features and a continuous target variable. Regression analysis is valuable for understanding how the independent variables influence the dependent variable and for making predictions based on this understanding. We will utilize a suitable regression model, such as Linear Regression, to assess the influence of various features on the target variable, which could be the number of jobs or total establishments. The model will be evaluated using metrics such as R² score, Mean Absolute Error (MAE), and Mean Squared Error (MSE) to quantify its performance. Through this analysis, we aim to uncover insights into the dynamics of business establishments and employment within the dataset.

In [65]:
# Define the target variable (e.g., total_jobs) and features
target = 'total_jobs'
features = ['census_year', 'longitude', 'latitude', 'clue_small_area_numeric',
            'industry_anzsic4_code', 'total_establishments', 'clue_industry_numeric']

# Split the data into training and testing sets
X = cleaned_df[features]
y = cleaned_df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = linear_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


# Display the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("R² Score:", r2)

# Optionally, visualize the predictions vs. actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()
Mean Squared Error (MSE): 130923.17767442112
Mean Absolute Error (MAE): 262.99255646703307
R² Score: 0.6645396958258626
In [66]:
# Create and train the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)  # You can adjust n_estimators
rf_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_regressor.predict(X_test)

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Display the evaluation metrics
print("Random Forest Regressor Evaluation Metrics:")
print("Mean Squared Error (MSE):", mse_rf)
print("Mean Absolute Error (MAE):", mae_rf)
print("R² Score:", r2_rf)

# Optionally, visualize the predictions vs. actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Random Forest: Actual vs. Predicted Values')
plt.show()
Random Forest Regressor Evaluation Metrics:
Mean Squared Error (MSE): 0.0
Mean Absolute Error (MAE): 0.0
R² Score: 1.0

Spatial Analysis¶

Our spatial analysis approach employs a comprehensive set of techniques to analyze and visualize geographical data, particularly focusing on address distribution patterns. We begin with exploratory data visualization using heatmaps and 3D scatter plots to understand the overall density and distribution of addresses. To identify clusters and patterns, we implement DBSCAN clustering with convex hulls, providing insights into the spatial grouping of addresses. We then utilize spatial joins to relate our data points to specific landmarks, offering context to the geographical layout. For a more nuanced understanding of density, we employ Kernel Density Estimation (KDE) plots and hexbin plots. To assess the spatial randomness and clustering of our data, we conduct Ripley's K Function analysis. This multi-faceted approach allows us to examine the spatial characteristics of our address data from various perspectives, enabling a thorough understanding of geographical patterns, clusters, and distributions within the dataset.

This comprehensive approach provides a robust framework for analyzing and interpreting the spatial patterns in your address data.

In [67]:
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
    street_add,
    geometry=gpd.points_from_xy(street_add.longitude, street_add.latitude),
    crs="EPSG:4326"
)

The heatmap visualization provides a clear and striking insight into the geographical distribution of our data points. The map reveals a highly concentrated cluster of addresses centered around Melbourne, Australia. This intense concentration, depicted by the vibrant red and yellow colors, indicates a significant density of addresses in the Melbourne metropolitan area. The heatmap's circular pattern, transitioning from red at the center to cooler colors (orange, yellow, green, and blue) at the edges, suggests a gradual decrease in address density as we move away from the city center. This pattern is characteristic of a major urban area, with the highest concentration of addresses in the central business district and inner suburbs, and a progressive thinning out towards the outskirts. The stark contrast between the dense cluster and the surrounding areas highlights Melbourne's role as a major population center in the region, standing out prominently against the less populated areas of Victoria state.

In [68]:
#  Heatmap
heat_data = [[row.latitude, row.longitude] for idx, row in gdf.iterrows()]
heat_map = folium.Map(location=[gdf.latitude.mean(), gdf.longitude.mean()], zoom_start=12)
HeatMap(heat_data).add_to(heat_map)
display(heat_map)
Make this Notebook Trusted to load map: File -> Trust Notebook

The 3D scatter plot and spatial join analysis provide rich insights into the spatial distribution of addresses in Melbourne. The plot visualizes the city's suburbs as distinct clusters, each represented by a unique color, showcasing the geographical layout and relative positions of areas like North Melbourne, Southbank, and Docklands. The z-axis, representing suburb IDs, adds depth to our understanding of how these neighborhoods are organized vertically, possibly indicating variations in elevation or administrative boundaries. Notably, the spatial join analysis reveals a significant concentration of addresses within the city center, with 62,338 points located within a 5km radius of Melbourne's CBD. This high density underscores the centralized nature of Melbourne's urban development, highlighting the CBD as a focal point of residential and commercial activity. The visualization and quantitative data together paint a picture of a densely populated, well-defined urban core surrounded by distinct suburban areas, each with its own spatial characteristics and relationship to the city center.

In [69]:
# 3D scatter plot
fig = px.scatter_3d(gdf, x='longitude', y='latitude', z='suburb_id', color='suburb',
                    hover_data=['address_pnt'])
fig.show()

# Spatial join (example: finding points within a certain distance of a landmark)
landmark = gpd.GeoDataFrame({'geometry': [Point(144.9631, -37.8136)]}, crs="EPSG:4326")  # Melbourne CBD
nearby_points = gpd.sjoin_nearest(gdf, landmark, max_distance=5000)  # Points within 5km
print(f"Number of points within 5km of Melbourne CBD: {len(nearby_points)}")
Number of points within 5km of Melbourne CBD: 63721

The Kernel Density Estimation (KDE) plots for business distribution in 2010,2015 and 2020 reveal a concentration of businesses within a specific geographic region, as indicated by the high-density areas (red zones) in the plots. Between these years, the core area of business activity appears consistent, with only slight changes in density and spatial spread. The KDE plots demonstrate that while the overall distribution of businesses has remained stable, there may be a subtle increase in density within the central region, indicating potential business growth or clustering in this area over the five-year period. The KDE method effectively highlights the geographic concentration of businesses and any temporal changes in their distribution.

In [70]:
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(cleaned_df, geometry=gpd.points_from_xy(cleaned_df['longitude'], cleaned_df['latitude']), crs="EPSG:4326")

# Filter data for specific years (for example, 2010, 2015, 2020)
years_to_analyze = [2010, 2015, 2020]
gdf_filtered = gdf[gdf['census_year'].isin(years_to_analyze)]

# Plot KDE for each year
for year in years_to_analyze:
    fig, ax = plt.subplots(figsize=(12, 8))
    subset = gdf_filtered[gdf_filtered['census_year'] == year]

    # Reproject to Web Mercator for contextily
    subset = subset.to_crs(epsg=3857)

    # Extract x and y coordinates from the geometry
    subset['x'] = subset.geometry.x
    subset['y'] = subset.geometry.y

    # Plot KDE
    sns.kdeplot(data=subset, x='x', y='y', cmap='YlOrRd', shade=True, cbar=True, ax=ax)

    # Add basemap
    ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

    # Set extent to match the data
    ax.set_xlim(subset.total_bounds[[0, 2]])
    ax.set_ylim(subset.total_bounds[[1, 3]])

    # Remove axis labels as they're not meaningful in this projection
    ax.set_axis_off()

    plt.title(f'Kernel Density Estimation of Business Distribution in {year}')
    plt.tight_layout()
    plt.show()

The Ripley's K function analysis graph compares the observed spatial distribution of addresses (blue line) to the expected distribution under Complete Spatial Randomness (CSR, orange line). The observed K values rapidly decrease below the CSR line as distance increases, indicating significant clustering of addresses at all scales examined. This strong clustering is most pronounced at shorter distances and gradually levels off at larger distances, but remains well below the CSR expectation throughout the entire range. The sharp deviation from CSR suggests that the address distribution is highly non-random, with addresses tending to be much closer together than would be expected in a uniformly distributed pattern. This aligns with typical urban development patterns, where buildings and addresses are concentrated in specific areas rather than spread evenly across the landscape.

In [71]:
#  Ripley's K Function Analysis
def ripley_k(points, distances, area):
    n = len(points)
    k_values = []
    for d in distances:
        pairs = distance.cdist(points, points) < d
        k = (area * np.sum(pairs)) / (n * (n - 1))
        k_values.append(k)
    return k_values

# Sample 1000 points for faster computation
sample_size = min(1000, len(gdf))
sample_gdf = gdf.sample(sample_size)

points = np.column_stack((sample_gdf.geometry.x, sample_gdf.geometry.y))
area = gdf.total_bounds[2] * gdf.total_bounds[3]  # approximate area
distances = np.linspace(0, 0.1, 50)  # adjust based on your data's scale

k_observed = ripley_k(points, distances, area)
k_expected = np.pi * distances**2

plt.figure(figsize=(10, 6))
plt.plot(distances, k_observed, label='Observed K')
plt.plot(distances, k_expected, label='Expected K (CSR)')
plt.xlabel('Distance')
plt.ylabel("Ripley's K")
plt.title("Ripley's K Function Analysis")
plt.legend()
plt.show()

Time Series Aanalysis¶

In this time series analysis, our goal is to model and forecast the total number of jobs from 2003 to 2021 using various statistical and machine learning techniques. By leveraging this historical job data, we aim to uncover patterns, trends, and any seasonal behaviors that can help us better understand the employment dynamics over the years. We will explore three key models: ARIMA (AutoRegressive Integrated Moving Average), Holt-Winters Exponential Smoothing, and the Vector Autoregression (VAR) model. Each model takes a different approach to time series forecasting, allowing us to compare their effectiveness in predicting future trends.

The purpose of this analysis is to evaluate which model provides the best fit to our historical data and make accurate predictions for the next five years, covering 2022 to 2026. We will assess the performance of each model using evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared. By generating forecasts based on the fitted models, we aim to provide valuable insights that can inform economic planning and resource allocation, offering a data-driven outlook on future employment trends. This comparative analysis will help us identify the most reliable model for predicting job growth in the coming years.

In [72]:
cleaned_df['census_year'].value_counts()
Out[72]:
2021    19912
2020    19748
2019    19725
2018    19722
2022    19660
2017    19570
2015    18999
2014    18908
2012    18770
2011    18677
2013    18553
2010    18300
2009    17991
2016    17715
2008    17400
2007    15788
2006    14891
2005    14257
2004    13750
2002    13497
2003    13496
Name: census_year, dtype: int64
In [73]:
cleaned_df.info() # this the cleaned dataset
<class 'pandas.core.frame.DataFrame'>
Int64Index: 369329 entries, 0 to 369328
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   census_year                   369329 non-null  int64  
 1   block_id                      369329 non-null  int64  
 2   property_id                   369329 non-null  int64  
 3   base_property_id              369329 non-null  int64  
 4   clue_small_area               369329 non-null  object 
 5   trading_name                  369329 non-null  object 
 6   business_address              369329 non-null  object 
 7   industry_anzsic4_code         369329 non-null  int64  
 8   industry_anzsic4_description  369329 non-null  object 
 9   longitude                     369329 non-null  float64
 10  latitude                      369329 non-null  float64
 11  anzsic_indusrty               369329 non-null  object 
 12  clue_industry                 369329 non-null  object 
 13  business_size                 369329 non-null  object 
 14  total_establishments          369329 non-null  int64  
 15  total_jobs                    369329 non-null  float64
 16  business_size_numeric         369329 non-null  int64  
 17  clue_small_area_numeric       369329 non-null  int64  
 18  clue_industry_numeric         369329 non-null  int64  
dtypes: float64(3), int64(9), object(7)
memory usage: 56.4+ MB

Here, we just plot the actual dataset and going to see what are trends. The plot shows the total number of jobs from 2003 to 2021, highlighting several key trends. After a period of slow growth from 2003 to 2005, there is a steady rise in total jobs until around 2012, marking a strong phase of job expansion. This is followed by a period of fluctuation, with a slight decline around 2013-2016, and a more significant drop in 2017-2018. The labor market then recovers, showing stabilization from 2019 to 2021, though with a minor decline towards the end of the period, possibly indicating early signs of economic challenges as 2022 approaches.

In [74]:
# Load and prepare data

yearly_data = cleaned_df.groupby('census_year')['total_jobs'].sum().reset_index()
yearly_data.set_index('census_year', inplace=True)

# Add some features for VAR model
yearly_data['trend'] = np.arange(len(yearly_data))
yearly_data['total_jobs_diff'] = yearly_data['total_jobs'].diff()

# Plot the time series
plt.figure(figsize=(15, 7))
plt.plot(yearly_data.index, yearly_data['total_jobs'])
plt.title('Total Jobs Over Time')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.xticks(yearly_data.index, rotation=45)
plt.tight_layout()
plt.show()

The results from the Augmented Dickey-Fuller (ADF) test indicate the stationarity of both the original and differenced job data. For the original data, the ADF statistic is -9.40 with a p-value of approximately 6.13e-16, which is far below the 0.05 significance level. The critical values at the 1%, 5%, and 10% levels are also higher than the ADF statistic, suggesting we can reject the null hypothesis of non-stationarity. This implies that the original data is already stationary, which is somewhat unusual for most raw time series data.

For the differenced data, the ADF statistic is -49.02 with a p-value of 0.0, which confirms an even stronger rejection of the null hypothesis. The differenced data is clearly stationary, as indicated by the very large negative ADF statistic and zero p-value. This result shows that differencing has successfully removed any remaining non-stationarity, though the original data was likely already stationary based on the initial test.

In [75]:
# Check for stationarity
def test_stationarity(timeseries):
    result = adfuller(timeseries, autolag='AIC')
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:', result[4])

print("Stationarity Test for Original Data:")
test_stationarity(yearly_data['total_jobs'])

# If data is not stationary, difference it
yearly_data['total_jobs_diff'] = yearly_data['total_jobs'].diff()
yearly_data = yearly_data.dropna()

print("\nStationarity Test for Differenced Data:")
test_stationarity(yearly_data['total_jobs_diff'])
Stationarity Test for Original Data:
ADF Statistic: -3.987381506113443
p-value: 0.001478045945905543
Critical Values: {'1%': -4.137829282407408, '5%': -3.1549724074074077, '10%': -2.7144769444444443}

Stationarity Test for Differenced Data:
ADF Statistic: 18.585124912652123
p-value: 1.0
Critical Values: {'1%': -4.223238279489106, '5%': -3.189368925619835, '10%': -2.729839421487603}

In this time series analysis of total jobs from 2003 to 2021, the ACF and PACF plots provide insight into the underlying patterns in the data. The ACF plot shows a gradual decline in autocorrelation across increasing lags, suggesting that the data may exhibit non-stationarity or an autoregressive structure with persistence up to around 4 lags. The PACF plot, on the other hand, exhibits a sharp cutoff after lag 1, indicating that an AR(1) model might capture most of the dependencies in the data. This pattern suggests that the data could be well-represented by a low-order autoregressive model, and further differencing or model refinement may be needed to address any potential non-stationarity. These insights will inform the choice of models such as ARIMA in the forecasting process.

In [76]:
# Calculate and plot ACF
acf_values = acf(yearly_data['total_jobs'])
plt.figure(figsize=(12, 6))
plt.bar(range(len(acf_values)), acf_values)
plt.title('Autocorrelation Function')
plt.xlabel('Lag')
plt.ylabel('ACF')
plt.show()

# Calculate and plot PACF
pacf_values = pacf(yearly_data['total_jobs'])
plt.figure(figsize=(12, 6))
plt.bar(range(len(pacf_values)), pacf_values)
plt.title('Partial Autocorrelation Function')
plt.xlabel('Lag')
plt.ylabel('PACF')
plt.show()

ARIMA Model Implementation¶

The ARIMA(0, 2, 0) model selected based on the lowest AIC value of 606.554 suggests a simple model with two levels of differencing and no autoregressive or moving average components. The large sigma squared value (3.471e+12) indicates high variance in the residuals. The Ljung-Box test (Q = 7.20, p = 0.01) suggests there may be autocorrelation in the residuals, potentially indicating a poor fit. Additionally, the Jarque-Bera test (JB = 25.00, p = 0.00) shows non-normality in the residuals, with significant skewness (1.56) and high kurtosis (7.67), further implying the need for model refinement.

In [77]:
# Function to evaluate models
def evaluate_model(actual, predicted):
    mse = mean_squared_error(actual, predicted)
    rmse = sqrt(mse)
    mae = mean_absolute_error(actual, predicted)
    r2 = r2_score(actual, predicted)
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R-squared': r2}

# Auto ARIMA to find best parameters
auto_arima_model = auto_arima(yearly_data['total_jobs'], start_p=1, start_q=1,
                              test='adf', max_p=5, max_q=5, m=1,
                              d=None, seasonal=False, start_P=0,
                              D=0, trace=True, error_action='ignore',
                              suppress_warnings=True, stepwise=True)
print(f"Best ARIMA order: {auto_arima_model.order}")

# ARIMA model with best parameters
arima_model = ARIMA(yearly_data['total_jobs'], order=auto_arima_model.order)
arima_results = arima_model.fit()
print(arima_results.summary())
Performing stepwise search to minimize aic
 ARIMA(1,2,1)(0,0,0)[0] intercept   : AIC=535.194, Time=0.06 sec
 ARIMA(0,2,0)(0,0,0)[0] intercept   : AIC=530.580, Time=0.00 sec
 ARIMA(1,2,0)(0,0,0)[0] intercept   : AIC=532.122, Time=0.01 sec
 ARIMA(0,2,1)(0,0,0)[0] intercept   : AIC=533.141, Time=0.01 sec
 ARIMA(0,2,0)(0,0,0)[0]             : AIC=528.808, Time=0.00 sec

Best model:  ARIMA(0,2,0)(0,0,0)[0]          
Total fit time: 0.092 seconds
Best ARIMA order: (0, 2, 0)
                               SARIMAX Results                                
==============================================================================
Dep. Variable:             total_jobs   No. Observations:                   20
Model:                 ARIMA(0, 2, 0)   Log Likelihood                -263.404
Date:                Wed, 18 Sep 2024   AIC                            528.808
Time:                        01:01:15   BIC                            529.699
Sample:                             0   HQIC                           528.931
                                 - 20                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
sigma2       2.69e+11   6.43e+10      4.187      0.000    1.43e+11    3.95e+11
===================================================================================
Ljung-Box (L1) (Q):                   3.33   Jarque-Bera (JB):                 4.04
Prob(Q):                              0.07   Prob(JB):                         0.13
Heteroskedasticity (H):               2.73   Skew:                             0.90
Prob(H) (two-sided):                  0.25   Kurtosis:                         4.46
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

The residual diagnostics of the ARIMA model show several concerns. The top left plot indicates non-constant residuals with a few extreme spikes, suggesting model misfit. The histogram (top right) shows some skewness and deviation from the normal distribution, while the Q-Q plot (bottom left) reveals that residuals deviate from the theoretical normal line, especially at the extremes. Lastly, the correlogram (bottom right) shows some autocorrelation in the residuals, meaning the ARIMA model fails to capture all the time-dependent structure in the data. Overall, the ARIMA model doesn't adequately fit this dataset.

In [78]:
arima_results.plot_diagnostics(figsize=(15, 12))
plt.show()

Holt-Winters Exponential Smoothing Model¶

The Holt-Winters Exponential Smoothing model with an additive trend was fit to the job data, resulting in an AIC of 599.435, indicating a better fit than the ARIMA model. The model's smoothing level (alpha = 0.735) suggests that recent observations have a strong influence on the forecasts, while the smoothing trend (beta = 0.184) shows that the trend component is being updated at a slower rate. The high initial level (1.67e+07) and initial trend (1.18e+06) indicate the starting point and growth rate of the series. Overall, this model captures both the level and trend of the job data effectively, but further evaluation is needed using forecast accuracy metrics.

In [79]:
# Holt-Winters model
hw_model = ExponentialSmoothing(yearly_data['total_jobs'], trend='add', seasonal=None)
hw_results = hw_model.fit()
print(hw_results.summary())
                       ExponentialSmoothing Model Results                       
================================================================================
Dep. Variable:               total_jobs   No. Observations:                   20
Model:             ExponentialSmoothing   SSE                  3810976327383.089
Optimized:                         True   AIC                            527.463
Trend:                         Additive   BIC                            531.446
Seasonal:                          None   AICC                           533.925
Seasonal Periods:                  None   Date:                 Wed, 18 Sep 2024
Box-Cox:                          False   Time:                         01:01:15
Box-Cox Coeff.:                    None                                         
==============================================================================
                       coeff                 code              optimized      
------------------------------------------------------------------------------
smoothing_level            0.9950000                alpha                 True
smoothing_trend            0.2842857                 beta                 True
initial_level             1.2433e+07                  l.0                 True
initial_trend             6.2522e+05                  b.0                 True
------------------------------------------------------------------------------

VAR Model Implementation¶

VAR model is applied and results are evaluated below:

In [80]:
# VAR model
var_data = yearly_data[['total_jobs', 'trend', 'total_jobs_diff']].dropna()
var_model = VAR(var_data)

Model Evaluation¶

The evaluation results of the three forecasting models—VAR, ARIMA, and Holt-Winters—reveal important insights into their performance on the job data.

The VAR (Vector Autoregression) model shows the strongest performance overall, with a low Mean Squared Error (MSE) of 1.25 trillion, a Root Mean Squared Error (RMSE) of about 1.12 million, and a Mean Absolute Error (MAE) of 738,809. Most notably, it has a high R-squared value of 0.89, indicating that the model explains about 89% of the variance in the data. This suggests that VAR captures the underlying structure of the job trends very effectively.

The ARIMA model, on the other hand, performs poorly, with a significantly higher MSE of 38.88 trillion and an RMSE of 6.24 million. The MAE is also much larger at 2.90 million, and the R-squared value is negative (-1.42), indicating that the model fits the data worse than a simple average model would. This suggests that ARIMA struggles to capture the dynamics of this particular time series, potentially due to the lack of a seasonal component or an inadequate differencing order.

The Holt-Winters Exponential Smoothing model performs moderately well, with an MSE of 1.70 trillion and an RMSE of 1.31 million. The MAE is 970,600, slightly higher than VAR, but the R-squared value is comparable to the VAR model at 0.89, meaning it also explains 89% of the data’s variance. The Holt-Winters model, which captures trend components without seasonal adjustments, provides solid predictions, though it is slightly less accurate than VAR.

In conclusion, while the ARIMA model is not suitable for this dataset, both the VAR and Holt-Winters models show good performance, with VAR slightly outperforming Holt-Winters in terms of prediction accuracy. Therefore, VAR is the most reliable model for forecasting future job trends based on this data.

In [81]:
# Calculate the maximum number of lags
n_obs = len(var_data)
n_vars = var_data.shape[1]
max_lags = min(15, int((n_obs - n_vars - 1) / (n_vars * n_vars + n_vars)))

if max_lags > 0:
    var_results = var_model.fit(maxlags=max_lags, ic='aic')
    var_fittedvalues = var_results.fittedvalues['total_jobs']

    # Align the fitted values with the original data
    var_fittedvalues = var_fittedvalues.reindex(var_data.index)
    var_fittedvalues.iloc[:max_lags] = np.nan  # Set initial values to NaN

    # Evaluate only on the non-NaN values
    var_evaluation = evaluate_model(var_data['total_jobs'].iloc[max_lags:], var_fittedvalues.iloc[max_lags:])
    print("\nVAR Model Evaluation:")
    print(var_evaluation)
else:
    print("\nNot enough data for VAR model. Skipping VAR analysis.")

# Evaluate models
arima_evaluation = evaluate_model(yearly_data['total_jobs'], arima_results.fittedvalues)
hw_evaluation = evaluate_model(yearly_data['total_jobs'], hw_results.fittedvalues)

print("\nARIMA Model Evaluation:")
print(arima_evaluation)
print("\nHolt-Winters Model Evaluation:")
print(hw_evaluation)
VAR Model Evaluation:
{'MSE': 127722520263.88837, 'RMSE': 357382.87628800626, 'MAE': 261020.6181255932, 'R-squared': 0.9559655850224672}

ARIMA Model Evaluation:
{'MSE': 16325908112939.863, 'RMSE': 4040533.147115596, 'MAE': 1626990.5999941952, 'R-squared': -3.3638093238933333}

Holt-Winters Model Evaluation:
{'MSE': 190548816369.15448, 'RMSE': 436518.9759554039, 'MAE': 354872.5046318023, 'R-squared': 0.9490675375742501}

Models Predicitons for Next Years¶

The graph compares the performance of three models—ARIMA, Holt-Winters, and VAR—used to forecast total jobs from 2022 to 2026, based on historical data from 2002 to 2021. The VAR model (in red) and Holt-Winters model (in green) show upward trends, with VAR providing a slightly steeper increase in job growth. Both of these models fit well with historical data, as suggested by their performance metrics (such as high R-squared values). However, the ARIMA model (in orange) forecasts a declining trend, indicating poor fit to the data, which is also evident from its evaluation metrics.

Thus, the VAR model appears to provide the most accurate predictions, followed closely by Holt-Winters, while ARIMA performs poorly.

In [82]:
# Forecast next 5 years
future_years = range(yearly_data.index[-1] + 1, yearly_data.index[-1] + 6)
arima_forecast = arima_results.forecast(steps=5)
hw_forecast = hw_results.forecast(5)

if max_lags > 0:
    lag_order = var_results.k_ar
    var_forecast_input = var_data.values[-lag_order:]
    var_forecast = var_results.forecast(y=var_forecast_input, steps=5)
    var_forecast = pd.DataFrame(var_forecast, columns=var_data.columns)['total_jobs']
else:
    var_forecast = None

# Plot forecasts
plt.figure(figsize=(15, 7))
plt.plot(yearly_data.index, yearly_data['total_jobs'], label='Historical Data')
plt.plot(future_years, arima_forecast, label='ARIMA Forecast')
plt.plot(future_years, hw_forecast, label='Holt-Winters Forecast')
if var_forecast is not None:
    plt.plot(future_years, var_forecast, label='VAR Forecast')
plt.title('Total Jobs Forecast')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend()
plt.xticks(list(yearly_data.index) + list(future_years), rotation=45)
plt.tight_layout()
plt.show()
In [83]:
print("\nARIMA Forecast for next 5 years:")
print(arima_forecast)
print("\nHolt-Winters Forecast for next 5 years:")
print(hw_forecast)
if var_forecast is not None:
    print("\nVAR Forecast for next 5 years:")
    print(var_forecast)
else:
    print("\nVAR Forecast not available due to insufficient data")
ARIMA Forecast for next 5 years:
20    18575401.0
21    18246531.0
22    17917661.0
23    17588791.0
24    17259921.0
Name: predicted_mean, dtype: float64

Holt-Winters Forecast for next 5 years:
20    1.892948e+07
21    1.895224e+07
22    1.897500e+07
23    1.899775e+07
24    1.902051e+07
dtype: float64

VAR Forecast for next 5 years:
0    1.894285e+07
1    1.905585e+07
2    1.917729e+07
3    1.929376e+07
4    1.940369e+07
Name: total_jobs, dtype: float64

Conclusion¶

In this time series analysis, we aimed to forecast total job numbers from 2022 to 2026 using historical data from 2002 to 2021, employing three key models: ARIMA, Holt-Winters Exponential Smoothing, and Vector Autoregression (VAR). The goal was to identify patterns, trends, and potential seasonal behaviors to make accurate predictions and offer valuable insights for future economic planning and resource allocation. Model Performance Summary:

ARIMA Model:

The ARIMA model performed poorly, showing a declining trend in job numbers that contrasts sharply with historical data.

Residual analysis indicated serious issues. The residuals are non-normal, with significant deviations from zero in the correlogram, meaning that the model leaves unexplained autocorrelation. The Q-Q plot shows that residuals deviate from normality, especially at the tails, further suggesting poor model fit.

The ARIMA model is not well-suited to this dataset. It lacks the ability to capture the dynamics of job trends, possibly due to inadequate differencing or a missing seasonal component. Its high error metrics and negative R-squared value further emphasize its inefficiency.

Holt-Winters Exponential Smoothing Model:

The Holt-Winters model provided moderately good forecasts with an upward trend, indicating that it can capture the overall job growth. The model does not explicitly include seasonality, but it still performs well, as the job data shows no strong seasonal components.

The Holt-Winters model performed well with low error metrics and a high R-squared value, suggesting it can capture the trend component of the data effectively. However, it is slightly less accurate than the VAR model.

VAR (Vector Autoregression) Model:

The VAR model produced the most reliable forecasts, indicating steady job growth. It slightly outperformed Holt-Winters in terms of capturing the underlying trends and making accurate predictions.

The model fits the historical data closely, and its error metrics (MSE, RMSE, MAE) are the lowest among the models.

The VAR model emerged as the best fit for this dataset. Its ability to capture relationships across multiple time series (i.e., variables influencing job trends) contributed to its superior performance. The high R-squared value indicates that this model explains almost 89% of the variance in the data, making it the most reliable for forecasting.

Key Insights:¶

Job Trends: Both the VAR and Holt-Winters models predict sustained job growth from 2022 to 2026. The ARIMA model's prediction of a decline does not align with historical trends, further casting doubt on its suitability.

The VAR model is the most accurate, followed closely by Holt-Winters. ARIMA, with its poor performance metrics, is not recommended for future forecasting of job trends in this context.

Recommendations:¶

Economic Planning:

Based on the VAR and Holt-Winters forecasts, policymakers can expect continued job growth, which supports optimistic economic planning. However, close monitoring of external factors like economic policies and global conditions is essential to ensure the forecasts remain valid.

Further Improvements:

Future analyses may consider exploring additional variables that could impact job trends, such as inflation, technological advancements, or regional differences. Refining the seasonal components and considering exogenous factors in VAR models could also improve forecast accuracy.

In conclusion, the VAR model stands out as the most reliable for predicting future job trends in this dataset, offering valuable insights for policymakers and stakeholders.